Systematic Methodology With DF T Rules Reduces Fault-Coverage Analysis
By Luke L.Chang
Integrated System Design
Posted 08/03/01, 10:36:58 AM EDT
Download a PDF of this article: Part 1, 2, 3, 4, 5, 6, 7, 8,9,10
A wealth of information has been published on the design-for-test (DFT) rules that are necessary for high fault coverage.But what if the initial coverage is unacceptably low? For such cases there is hardly any information available today on how to systematically and quickly analyze the results as well as to find ways to increase the coverage percentage. These matters are important because ASIC designers routinely perform such tasks.
This two-part discussion (Part 2 will appear in September) considers fault-coverage analysis and simulation for full scan testing of ASIC designs. However, those topics are equally applicable to other types of IC design, such as FPGAs.
Scan testing is widely used by ASIC vendors to detect manufacturing defects that are generally called stuck-at faults. In an ASIC, there could be many failures;they are grouped together based on their detrimental logical effect on the design. Stuck-at faults are logical fault models that serve as the basis for testing algorithms. Stuck-at-0 refers to a net segment being stuck at logic 0, stuck-at-1 to logic 1. If they have a detrimental effect on the design and if that effect is detectable, then the purpose of scan testing is to automatically generate a scan vector or vectors to detect it by using an automatic-test-pattern-generation (ATPG) tool. It will be pointed out later that not all stuck-at faults have an effect, such as those in redundant logic, and that not all stuck-at faults are detectable. The purpose of fault-coverage analysis is to accurately calculate the percentage of the total faults that are detectable and to come up with ways to cover any uncovered faults so that the final coverage percentage is 95 percent and higher. That is the standard used by ASIC vendors to prove that the circuit is physically manufactured correctly.
Full scan testing starts with replacing regular flops with scan flops. The design is additionally modified by the addition of logic to prevent the assertion of sets and resets. The scan flops are connected into one or more scan chains so that test data (scan vectors) can be shifted onto each scan flop, after which an ATPG tool is run to automatically generate scan vectors that cover as many faults as possible. These tasks are usually performed by ASIC designers. Scan vectors are first simulated against expected responses (also generated by ATPG tool) and then sent to ASIC vendors to run on the chip tester. Figure 1 shows an example of scan testing, a single scan chain (dotted line) consisting of four scan flops (muxscan). This chain originates at scan input scan_in and ends at scan output scan_out. input1 and input2 are primary inputs and output1 and output2 primary outputs; scanflop1 through scanflop4 are scan flops. L1 through L4 are combinational logic blocks. On a scan flop, input pin d is the logic pin and si the scan pin. Input pin sel selects either d or si to be clocked by the clk pin to the output logic pin q. Output pin qn is just q inverted.
This simple example illustrates the essence of scan testing. For a fault on an internal net to be detected, the ATPG tool must be able to control the value to put on that net as well as to propagate that value to either a primary output pin or the scan output pin for observation. If
a fault is not detectable, then one or both of these requirements is not met.
Over the years, many DFT rules have been developed for high fault coverage; if they are followed, scan-vector generation should be straightforward and very fast. However,that is a big if -- many designers still do not pay enough attention to them. As a result, the initial fault coverage averages about 75 percent to 85 percent, less than the required 95 percent or higher. Fault-coverage-analysis studies why a fault is not detectable and tries to find ways to detect it. However, this is usually a mammoth challenge because uncovered faults can number in the thousands even for so-called small designs. Our task is to study every one of them and find ways to cover them in days instead of months!
In this article, practical methods are presented to systematically analyze stuck-at faults not covered by full scan, with real-world design examples. Discussion will be centered around important DFT rules that, if followed, greatly increase fault coverage.
Fault-coverage analysis is based on single stuck-at faults. This means that if a pin of an internal gate is stuck at 0 or 1, the scan vectors can propagate this fault to the outputs for observation. This is not the same as design verification test (DVT), which studies the design percentage that has been functionally verified and exposes flaws that do not meet specifications. For example,if the OR gate in Figure 1 is mistakenly replaced by a NOR gate scan vectors cannot reveal this design flaw. Also, fault coverage is not toggle coverage; the latter is part of DVT and involves running simulation on the design while a toggle tool tracks the number of times each line of RTL
code is exercised. Its main purpose is to reveal the logic that has not been “touched ” by logic verification so that new tests can be written to verify it. Finally, scan testing does not check circuit performance (timing), which must be studied with static/dynamic timing analysis.
DFT rules
A common ASIC design methodology starts with an architectural specification for its functions and design partitions.This is followed by implementation in a hardware description language, such as Verilog and VHDL, or a general-purpose programming language, such as C and C++, to generate a design netlist. This netlist is then turned into a gate-level design by synthesis. Most synthesis tools are coupled with ATPG tools so that testability problems can be analyzed during synthesis instead of afterward. Another, older, approach is to postpone testability study until synthesis completes. Either way, one should face the same fault-coverage problems.
A design has a fixed number of possible faults, since each net can have at most two faults: stuck-at-0 and stuck-at-1. Fault coverage is defined as follows: Fault Coverage =Detected Faults/Total Faults.
It is obvious that to get high coverage, the numerator must be as large as possible since the denominator is fixed. Commercially available ATPG tools can generate scan vectors for very large designs in 30 minutes or fewer. However,they cannot provide a high coverage unless a design follows DFT rules.
After initial testability study (during or after synthesis), every possible fault in the design should fall into one of the following three categories:
- Detected Faults:should be the majority (75 percent to 99 percent) of Total Faults. The test tool detects them within the given constraints,such as run-time limits. If the percentage is 95 percent or more, there is still a need to perform fault-coverage analysis.
- Undetected Faults: should be a very small part of Total Faults,assuming the design is suitable for full scan testing. Almost all should fall into the type of ABORTED, which means the test tool has aborted trying to detect these faults because of run-time limits. Some fault locations with massive fan-in or fan-out can be aborted. If run-time limits are relaxed, these faults will usually become untestable (explained below). Another type is UNTESTED, which means ATPG is never performed on them, but this type should not show up for full scan testing.
- Untestable Faults:should be a small part (1 percent to 25 percent) of Total Faults. These faults cannot be detected but may be detected if the current design and/or scan restraints setup are changed. They are the single biggest reason for low fault coverage. Some untestable faults cannot be tested no matter what, such as those in redundant logic.
An ASIC design is usually made up of several major modules, such as core module, PLL module, I/O pad module, test access port (TAP) controller, delay module, etc. The core module is usually developed by the ASIC designer,and its fault coverage concerns us the most, excluding RAM BIST submodules and redundant logic. The I/O pad module contains mainly I/O pad cells with boundary-scan cells that also are called boundary-scan registers, or BSRs, and the PLL module houses the PLL circuit and the associated test logic. The delay module
can be 100 inverters connected in a single chain so that its propagation delay can be measured during chip fabrication to test the silicon process parameters.
For full scan, the internal scan chain contains only the flops inside the core module and ATPG covers mainly its logic. Therefore, other modules will have low or almost no coverage. The delay module and TAP controller are tested with vectors that are usually created by non-scan
method. The PLL module is tested with special PLL tests and is never part of scan testing. The I/O pad module contains BSRs, BSR paths and PLL paths. During ATPG, PLL is not tested. Also, BSRs are forced into the through mode (pi pin is directly connected to po pin). As a result, coverage for this module is also low. BSRs should be covered by parametric test vectors that are generated by the ASIC vendor from the design's boundary-scan-description-language files.
Testing steps
Having said that,we need to focus on analyzing Untestable Faults in the core module. The first step is to use the ATPG tool to generate fault-coverage numbers for all the submodules. Table 1 shows nine of them or a design (numbered 1 through 9), and their initial fault-coverage percentages. Note that this table does not show any faults in the core module itself, but they must be analyzed too if some of them are untestable. Normally, the core module has no or very little logic (for synthesis efficiency) and there should be fewer than 100 faults in it.
From Table 1, we can see that Submodule 7 has the worst coverage, 13.02 percent, which should be the focus of our analysis as well as fault simulation effort, if necessary. Submodules 1,3,5,6 and 8 all have less than 90 percent coverage and they must be looked at also. What we have done here is find a starting point for the fault-coverage analysis; we have also spread a
large number of Untestable Faults among several submodules to make our work more manageable. For example, there are 11,214 Untestable Faults in Table 1, but there are only 628 in Submodule 7, which we will study first.
Untestable Faults can be divided into several categories,which are explained in Table 2. Their meanings will become more clear as we analyze examples of them.
ATPG tools can be used to generate a list of all Untestable Faults in a design. From this list,one can use the grep utility that is available on almost all operating systems to get all those in a particular submodule by simply “grepping”-- searching for strings in files -- for
the submodule name.One such fault in Submodule 7 is: u50/b1 S-A-0 NTESTABLE (UNOBSRV_UNTESTABLE)
This says that the stuck-at-0 fault on pin b1 of instance u50 is untestable. To be more specific, it is NOBSRV_UNTESTABLE (we will see why later). Finally, one should grep out exactly 628 of Untestable Faults for Submodule 7, because Table 1 says so. To further facilitate the analysis, divide them into the categories shown in Table 2, again by using the grep utility. It turns out that they fall into only three categories, as is shown in Table 3.
Let's start with the 16 NOBSRV_UNTESTABLE faults and one of them is stuck-at-0 fault on pin b1 of instance u50, as is shown above. Figure 2 shows the actual design logic. In this figure, the dotted line is the scan chain. Instance shared_mem is a BIST module that contains an
embedded RAM and its associated BIST logic. For ATPG testing, RAMs are modeled as black boxes through which no data can be pushed. As a result, the stuck-at-0 fault on pin b1 of u50 is untestable.
From Figure 2, it is also clear that instance u50 should have NCNTRL_UNTESTABLE faults, which are given below:
u50/a1 S-A-0 UNTESTABLE (UNCNTRL_UNTESTABLE)
u50/a1 S-A-1 UNTESTABLE (UNCNTRL_UNTESTABLE)
u50/b1 S-A-1 UNTESTABLE (UNCNTRL_UNTESTABLE)
=u50/zn S-A-0
=scanflop2/d S-A-0
We will explain soon what the =sign means.Remember that Table 3 says Submodule 7 has 144 of such faults, five of them on instance u50. We have already explained why stuck-at-0 and stuck-at-1 faults on pin a1 are UNCNTRL_UNTESTABLE; now let's see why stuck-at-
1 fault on pin b1 is so.To test this fault,we must be able to put a 0 there, which can be done.But since u50 is a NOR gate, pin a1 must be set to 1 (which is inverted to 0) for the effect of a 0 on pin b1 to be seen at pin zn. Since pin a1 cannot be controlled,the stuck-at-1 fault on pin b1
is reported as NCNTRL_UNTESTABLE. Also, to test stuck-at-0 fault on pin zn, we must put a 1 there, which means we must be able to put a 0 on pin b1, which can be done), and a 1 on pin a1,which cannot be controlled. So stuck-at-0 fault on zn is equivalent to stuck-at-1 fault
on b1 as well as to stuck-at-0 fault on pin d of scanflop2. Therefore, these 3 faults are equivalent and can be collapsed into one. This is indicated by the = sign. Two or more faults are equivalent if all tests that detect one also detect the other or others.
At this point, 1 NOBSRV_UNTESTABLE and 5 NCN-TRL_UNTESTABLE faults in Submodule 7 have been understood.We can safely conclude that the ATPG tool has
been set up and run correctly, because all those faults are indeed untestable. The study performed so far reveals several useful guidelines for systematic fault-coverage analysis:
- Run the ATPG tool once to get an initial fault coverage percentage for the design.
- Identify the module or modules that full scan is supposed to test. Usually, it is the core module.
- If the initial fault coverage is lower than 95 percent, identify all the submodules and use the APTG tool to generate the coverage percentage for each submodule. This is shown in Table 1.
- For each submodule,use the ATPG tool to generate a list of all the Untestable Faults, which are usually the single most important cause for low fault coverage. Start with the submodule with the lowest coverage.
- Divide all Untestable Faults in a submodule into those categories given in Table 2. We did this for Submodule 7 and the results are in Table 3.
- Pick one untestable fault in one category of the Untestable Faults and start to trace the logic surrounding the pin this fault is on. Often, one untestable fault can point out other untestable faults, such as what we have seen in Figure 2. This way, you can quickly and systematically analyze all the untestable faults in a submodule.
- Finish studying all the untestable faults in one submodule before moving on to the next, until all submodules are studied. Often,analysis of untestable faults in one submodule also explains the cause for those in other submodules. It is almost like an avalanche: Analysis of the first untestable fault is slow, but as time goes by, more and more faults will be analyzed more and
more quickly.
- After each submodule is analyzed, you need to find ways to increase its fault coverage. If the coverage of each submodule is raised to 95 percent or higher, the coverage for the entire design will automatically become that high. We will discuss ways to raise fault coverage towards the end of this section.
This methodology ensures that you can analyze all untestable faults in a design in days instead of months. For example, Figure 2 alone explains why six untestable
faults are that way.
Having presented a fault-coverage-analysis methodology,we will discuss two topics in parallel. First, we will give examples of the remaining categories of Untestable Faults in Table 2 to identify some DFT rules that ASIC designers can follow to have few untestable faults to begin with. This is the second topic.
The last type of untestable faults in Submodule 7 is DANGLING_UNTESTABLE (see Table 3). Twelve of them are related and listed below:
u110/i0 S-A-0 UNTESTABLE (DANGLING_UNTESTABLE)
u110/i0 S-A-1 UNTESTABLE (DANGLING_UNTESTABLE)
u110/i1 S-A-0 UNTESTABLE (DANGLING_UNTESTABLE)
u110/i1 S-A-1 UNTESTABLE (DANGLING_UNTESTABLE)
u110/s S-A-1 UNTESTABLE (DANGLING_UNTESTABLE)
u110/s S-A-0 UNTESTABLE (DANGLING_UNTESTABLE)
u110/zn S-A-1 UNTESTABLE (DANGLING_UNTESTABLE)
=u133/i S-A-1
=u133/zn S-A-0
u110/zn S-A-0 UNTESTABLE (DANGLING_UNTESTABLE)
=u133/i S-A-0
=u133/zn S-A-1
Instances u110 and u133 are shown in Figure 3.
The analysis we have done in Figures 2 and 3 produces our first DFT rule:Scan flops should be used to directly interface with a BIST module (that instantiates embedded RAMs and BIST logic), both at its inputs and outputs.
Figure 4 shows a design that violates this rule in general.
It is worthwhile pointing out now that we make this DFT rule No.1 for a big reason. For most real-world full scan designs, most of the untestable faults are caused by DFT Rule 1 not being followed. This is because embedded RAMs usually have many input and output pins (wide data and address buses), and each pin interfaces with identical nonscan logic. Since each pin of a gate has two possible faults,the number of untestable faults adds up very quickly, especially if an ASIC design uses many embedded RAMs.
Another kind of DANGLING_UNTESTABLE fault is the two faults on the qn pin of scan flops, when qn is used to connect the internal scan chain. Qn is used so that no
extra loading will be created on the q pin (logic pin) due to the scan chain, an approach that is used in Figure 1. Some ATPG tools report faults on qn pins used this way as DANGLING_UNTESTABLE. Qn pins can be used in the following ways:
- Qn pin is solely used to connect the scan chain. If your ATPG tool reports faults on it as untestable, then add them to Detected Faults: Fault Coverage = (Detected Faults +Faults on qn Pins)/Total Faults. This increases the Fault Coverage because the numerator is larger. We can do this because any faults on qn pins will be automatically detected by scan testing, since they will cause the scan chain to malfunction.
- Qn is used as a logic pin,such as in clock divider circuitry. Its faults must be covered.
- Qn is not used at all. If your ATPG tool reports faults on it as untestable,then subtract them from Total Faults. The coverage equation becomes: Fault Coverage Detected Faults/(Total Faults - Faults on qn Pins)
This increases Fault Coverage,since the denominator is smaller.
The fact that many designs use flops with qn pins calls for our second DFT rule:If qn pins of scan flops are not used, subtract their faults from Total Faults. If they are used to form the scan chain, add their faults to Detected Faults. If they are used as logic pins, then their faults must be covered. Note that this rule is mainly a guideline for fault-coverage analysis and should not af-
fect the design.But the designer needs to think about the effect of qn pins on scan testing.This effect is often ignored.
From the analysis above, we can infer that faults on unused output pins of internal cells must also be DANGLING_UNTESTABLE, and there is no way to remedy that. Hence our third DFT rule: Try to reduce the number of unused output pins on internal gates. Although faults on them are unimportant, they are reported by ATPG tool as untestable. Therefore, it takes time to identify them. Subtract them from Total Faults.
From now on, we will give examples of the remaining categories of untestable faults in Table 2. First, an example of ATG_UNTESTABLE fault, which is the stuck-at-1 fault on pin a1 of instance u1965 in Figure 5. It turns out that pin z of instance u1224 has a huge fan-out, but only a few of the destination pins (including a1 of u1965) have Untestable Faults. This means that the
ATG_UNTESTABLE fault is caused locally rather than globally. So focus on the logic that directly interfaces to instance u1965 instead of on all destination cells of u1224's pin z.
Redundant logic means it can always be simplified by removing at least one gate or one gate input. It is mainly caused by inefficient optimization in the synthesis tool, or it can be put into a design to improve performance. Redundant logic consumes area and power and it reduces
fault coverage, as has been shown. In static timing analysis, all redundant logic paths must be analyzed and the worst timing path must meet the requirement. Currently, ATPG is the best way to identify such logic, but not automatically. Therefore, DFT Rule 4 should be kept in mind: Try to optimize your synthesis so that no, or little, redundant logic is produced; try not to intentionally introduce it into your design. Such Untestable Faults must be manually identified and can be subtracted from Total Faults because they are untestable no matter what.
Not all ATG_UNTESTABLE faults are caused by redundant logic. In Figure 6, the two faults on pin s of instance u1365 are ATG_UNTESTABLE and they are caused by DFT Rule 1 not being followed.
Next, we will look at an example of TIE_UNCNTRL_UNTESTABLE fault (see Table 2). It usually occurs on input pins of a gate that are tied to VDD or GND, such as the following instance taken from a gate-level Verilog netlist:
ad02d1 dbladd0(.a0 (add1 [0 ]),.b0 (add2 [0 ]),.a1
(add1 [1 ]),.b1 (add2 [1 ]),.ci (1 'b0),.s0 (sum [0 ]),.s1
(sum [1 ]),.co (c2));
Instance dbladd0 is part of an adder block, and its carry-in pin ci is permanently tied to GND.As a result, the test tool cannot control the value on it. If a pin is tied to 0, then a stuck-at-0 fault cannot be tested because it cannot be set to a 1. Therefore, this fault should be subtracted from the Total Faults since it is truly untestable. But the stuck-at-1 fault on this pin can still be tested (detected). The same kind of reasoning applies to pins tied to 1. Since pins tied to a constant value reduce fault coverage, we give the following DFTRule 5: Try not to tie any primary I/O or internal cell pins (input or output) to a constant value. Some faults on them cannot be covered by ATPG or fault simulation.
Functional tests, if run on a chip tester, may detect such faults. For example, if the carry-in pin ci of the above example is stuck at 1, then functional tests can detect that the adder no longer adds correctly.
Another example of TIE_UNCNTRL_UNTESTABLE faults is the nonscan flop, such as the latch and any flop not in the internal scan chain. For ATPG testing, they are assigned a model type of ALWAYS0 or ALWAYS1, depending on which value is appropriate for overall ATPG
coverage. For example,if the output of a nonscan flop goes to a two-input AND gate whose other input comes from testable logic, then that flop will be modeled as ALWAYS1, making it possible to test the other input. Modeling a nonscan flop that way is the same as tying its output to a constant value.
So DFT Rule 6 is obvious, but is provided for completeness'sake: The number of nonscan flops in a full scan design should be 0 or very few. Faults on these gates are real and must be covered by other means, such as fault simulation.
Next, we will give an example of CONS_NCNTRL_UNTESTABLE faults, such as the stuck-at-1 fault on pin a2 of instance u1138 in Figure 7.
A quick explanation of the difference between CONS_UNCNTRL_UNTESTABLE and TIE_UNCNTRL_NTESTABLE faults. The former are on pins that are “constrained ” to a constant value by ATPG setup requirements during scan testing -- during normal functional operation, they may have opposite values. The latter are on pins that are hardwired to a constant value as required by the design specification.
Let's now look at a CONS_UNOBSRV_UNTESTABLE fault, which is stuck-at-1 fault on pin a1 of instance u7 in Figure 8.
CONS_UNOBSRV_UNTESTABLE and CONS_UNCNTRL_UNTESTABLE faults are caused by constraints in APTG setup, but are real faults. They must be covered by other means,such as fault or functional simulation or both.
At this point, we have studied the Untestable Faults that are most commonly seen in real world ASIC designs. The analysis produces several important DFT rules necessary for high fault coverage. There are a few other DFT rules that are also important and they will be briefly dis cussed next.
DFT Rule 7: If the back-end process inserts scan-related gates into a top module (such as the core module) that instantiates one or more submodules, then treat faults on those gates as “detected.” The top module should have no, or very little, design logic.
Most of these gates are inserted as buffers to distribute scan-testing signals or multiplexers to choose between the PLL clock or scan clock to go to the core module. Most uncovered faults on them can be treated as detected, because scan testing would produce errors if those gates have faults. However, there should be some functional test vectors for the multiplexers so that
we can test whether they work properly when selecting the PLL clock.
DFT Rule 8: If you use negative edge-triggered flops (which is a violation of full scan rules), think beforehand about how to test them.
In Figure 9, neg_flop1 is a negative edge-triggered flop. L1 through L4 are combinational logic blocks. Input1 and output1 are primary input and output, respectively.
Unfortunately,not all negative edge-triggered flops can be treated as buffers, such as neg_flop1 and neg_flop2 in Figure 10. The solutions can be any one of the following:
- Take neg_flop1 and neg_flop2 out of the scan chain, which means combinational logic blocks L1 and L2 are not testable, reducing coverage.
- Put neg_flop1 and neg_flop2 into a separate scan chain and model them as negative edge-triggered flops.
- Put neg_flop1 and neg_flop2 at the beginning of the existing scan chain. The scan data can be shifted in correctly and the primary outputs strobed. But the system clock cannot be applied to both negative and positive edge-triggered flops at the same time. Therefore, the test tool has to be run twice,once for each type of flops. Two sets of test vectors will be generated and run together during chip fabrication.
DFT Rule 9: Try not to multiplex PLL test pins with design primary I/O pins to reduce pin count.
Most ASIC vendors provide their own PLL block and associated self-test that requires several primary I/O pins. Many designers like to multiplex these test pins with design primary I/O pins to reduce pin count (if performance degradation is not an issue), because PLLs can never be tested when the ASIC is functionally operating. Figure 11 shows this design practice. Faults on pin i1, s and z are not testable. One solution is to insert a scan flop between PLL block and pin i0 of the multiplexer. Consider DFT consequences when you make a design decision such as reducing pin count.
DFT Rule 10: Do not use bidirectional signals as pure unidirectional ones. Figure 12 shows an example of this.
DFT Rule 11: If your gate-level netlist is in Verilog and if your test tool directly uses this netlist to generate ATPG vectors, then do not use any “assign”statements in it because some test tools automatically translate each assign statement into a pseudo buffer cell in the ATPG database. This pseudo cell is seen as an element of the netlist and its “faults”are targeted by ATPG, causing the target fault list to be different from the actual design. Unfortunately, most of such faults are ATG_UNTESTABLE and it takes time to identify them. An exception to Rule 11 is the use of “assign”statements in VDD and GND definitions, such as the following:
assign VDD =1 'b1;
assign GND =1 'b0;
Some back-end process tools automatically insert the words VDD and GND into the gate-
level netlist they generate. If the netlist does not have the two “assign" statements above, then it cannot be simulated.
Now we have finished discussing important DFT rules. Of course, a complete list of DFT rules is beyond the scope of this discussion, but those that we have discussed are common in real-world designs. If you follow them, you will have a very easy time with scan testing; if you don't, you won't. Some of the problems discussed so far cannot be solved unless you change the design. But design change during or after synthesis is always costly in time and resources.
Untestable faults
Now you know the most common DFT problems seen in real-world designs and how to perform fault-coverage analysis to identify the reasons for untestable faults.
The next logical question has to be what we can do about those untestable faults in order to raise fault coverage to beyond 95 percent. For each uncovered fault, you can do one of the fol